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ABSTRACT 

Objective To present a series of experiments: (1) to 
evaluate the impact of pre-annotation on the speed of 
manual annotation of clinical trial announcements; and 
(2) to test for potential bias, if pre-annotation is utilized. 
Methods To build the gold standard, 1400 clinical trial 
announcements from the clinicaltrials.gov website were 
randomly selected and double annotated for diagnoses, 
signs, symptoms, Unified Medical Language System 
(UMLS) Concept Unique Identifiers, and SNOMED CT 
codes. We used two dictionary-based methods to pre- 
annotate the text. We evaluated the annotation time 
and potential bias through F-measures and ANOVA tests 
and implemented Bonferroni correction. 
Results Time savings ranged from 13.85% to 21.5% 
per entity. Inter-annotator agreement (IAA) ranged from 
93.4% to 95.5%. There was no statistically significant 
difference for IAA and annotator performance in pre- 
annotations. 

Conclusions On every experiment pair, the annotator 
with the pre-annotated text needed less time to 
annotate than the annotator with non-labeled text. The 
time savings were statistically significant. Moreover, the 
pre-annotation did not reduce the IAA or annotator 
performance. Dictionary-based pre-annotation is a 
feasible and practical method to reduce the cost of 
annotation of clinical named entity recognition in the 
eligibility sections of clinical trial announcements without 
introducing bias in the annotation process. 



OBJECTIVE 

Natural language processing (NLP) projects require 
manually annotated gold standard corpora to train 
and test supervised, machine learning-based algo- 
rithms or, in the case of rule-based methods, to test 
the performance of the rules. In light of the high 
cost of expert manual annotations, NLP researchers 
need robust methods to speed up the annotation 
process, without biasing the generated gold stand- 
ard. In our institution, we are working on an 
NIH-funded project to automate clinical trial eligi- 
bility screening by using NLP algorithms. This 
effort requires the development of a substantial 
manually annotated gold standard. As such, this 
annotation is very time-consuming and costly. 

In this study, our aim is to present a series of 
experiments: (1) to evaluate the impact of pre- 
annotation on the speed of manual annotation of 
clinical trial announcements (CTA); and (2) to test 
for potential bias, if pre-annotation is utilized. 



We define potential bias as either increasing the dis- 
crepancy between annotators measured by inter- 
annotator agreement (IAA) or decreasing the agree- 
ment (called annotator performance in our study) 
between the annotations of the annotator with pre- 
annotated text and the eventual gold standard. The 
annotation task included labeling medical named 
entities in two classes: disease/disorder and sign/ 
symptom. Unified Medical Language System 
(UMLS) Concept Unique Identifiers (CUI) and 
SNOMED -CT codes were also annotated for each 
entity. 

The rest of the paper is structured as follows. In 
the 'Background and significance' section, we 
present relevant literature. In 'Data and methods', 
we describe the data, experimental methods, and 
analytical approaches. In the 'Results' section, we 
present the results. In the 'Discussion' section, we 
discuss the findings, limitations, and future research 
questions. In the final section, we provide our 
conclusions. 

BACKGROUND AND SIGNIFICANCE 

Pre-annotation has been studied widely in NLP tasks 
such as Named Entity Recognition (NER) (biomed- 
ical 1 ^ and astrophysical 5 domains), part of speech 
(POS) tagging (Wall Street Journal 6-9 and medical 
literature 2 10 ), and Semantic Frame/Role Labeling. 11 
These approaches used some machine learning 
systems with varying sizes of training data. Some 
systems did active learning pre-annotation, incre- 
mentally training on iterative human input and pre- 
senting annotators with pre-annotated text, 2 5 8 
while others 4 10 relied on an existing tool such as 
MetaMap 12 to generate a pre-annotation set to 
apply to the whole text. 

Many applications for different domains have 
been built in order to semi-automatically annotate 
text as the user is working, updating future files 
with machine learning output based on previous 
annotations. 13-18 These efforts all seek to decrease 
annotation time, but in our study we focus on the 
role of a single pre-annotation set for particular 
named entities in the clinical domain. The main 
contribution of this study is in evaluating — in the 
clinical domain — if dictionary-based annotation 
sets provide substantial savings in time without 
biasing the annotation. 

Several studies evaluated the time savings of pre- 
annotation. Using Wall Street Journal text, Ringger 
et al 6 studied the cost considerations of generating 
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a POS-tagged gold standard using many annotators and was 
able to reduce, by half, the amount of time it took to annotate 
the same amount of data. They concluded that the hourly cost 
savings were partially dependent on the (self-rated) expert level 
of the annotator. 

In the biomedical domain, Ganchev et al 1 developed a semi- 
automated system to pre-annotate MEDLINE abstracts with a 
high-recall named entity tagger for gene mentions, and reported 
an astounding 75% reduction in time for the best tagger. In the 
clinical domain and using some of the same entity classes as our 
study, Ogren et al 4 used MetaMap to pre-annotate for disease 
and disorder. They reported a longer time for the pre-annotation 
set and doubted that there was any benefit in the pre-annotation 
method, citing spurious annotations that needed to be corrected. 
By building on these earlier works, we compare the performance 
and time savings from different annotation sets in the clinical 
domain. Our study is unique in the biomedical domain, as it eval- 
uates the statistical significance of the potential bias effect of pre- 
annotation in addition to time and cost savings. 

Machine learning-based pre-annotation is built upon training 
the model on a small amount of annotated text. A question 
remains whether the machine learning model is necessary in 
these cases, or whether a simple dictionary-based pre-annotation 
set is sufficient. Due to the initial smaller training set, the per- 
formance of a machine learning model is expected to be lower 
than a dictionary-based approach. We hypothesize that the 
dictionary-based approach might not have as many spurious 
results as Ogren et aVs approach; consequently, the dictionary- 
based pre-annotation will successfully reduce annotation time. 

In evaluating the development of the Penn Treebank, Fort and 
Sagot 8 compared the quality of pre-annotation (using different 
POS taggers) and reported no significant difference in perform- 
ance (Krippendorf 's a 19 ) between the two annotators, discount- 
ing that pre-annotation causes bias. Nor did Neveol et al 10 find 



bias from pre-annotation on semantic annotation of PubMed 
queries. 

Other than in some limited domain set tasks, such as surname 
recognition 3 or POS tagging, 20 no dictionary-based pre- 
annotation method has been studied. Although not a dictionary 
method, pre-annotation of dates based on regular expressions 
was used to help decrease the time per annotation in a protected 
health information de-identification task of clinical notes. 21 

DATA AND METHODS 

The annotation task in our study included annotating disease/ 
disorder and sign/symptom entities. We followed the annotation 
guidelines and schema from the SHARPn project. 22 The 
SHARPn guidelines find and normalize clinically relevant men- 
tions to Clinical Element Model templates, linking CUIs to men- 
tions and identifying attributes and modifiers. We employed two 
experienced annotators (henceforth referred to as Al and A2) 
with bachelor degrees who had been trained using these guide- 
lines. One annotator had previous clinical expertise (as a regis- 
tered nurse) and a Bachelor of Science degree in Nursing. 
Chapman et al 23 demonstrated that using both clinician and 
non-clinician annotators does not bias the annotated corpus, 
although non-clinicians need longer training time. The annota- 
tors were given access to the UMLS Terminology Services 
SNOMED CT 24 and Metathesaurus Browsers, in order to look 
up terms and assign CUIs and SNOMED-CT Codes (CODEs). 
The following is an example sentence from a CTA: 'Suspected 
of having lung cancer due to clinical symptoms, such as positive 
sputum cytology, hemoptysis, unresolved pneumonia, persistent 
cough...' A sample screen shot from the SNOMED-CT browser 
while searching for lung cancer is shown in figure 1. 

Malignant tumor of lung is the best match for lung cancer and 
so the CODE (listed in the browser window as Concept: 
363358000) and CUI (C0242379) are annotated with the span 
lung cancer. The five entities in the sample sentence {lung 
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Figure 1 UMLS Technology Services SNOMED-CT browser: search for lung cancer. 
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cancer, symptoms, hemoptysis, pneumonia, and persistent 
cough) are all annotated with associated CUI and CODEs, as 
shown in figure 2. 

Three entities belong to sign/symptom class and two are 
disease/disorder. Lab or test results (such as positive x-ray or posi- 
tive sputum cytology) were not annotated. The Protege plug-in 
Knowtator 25 was used for annotating the corpora. A screenshot 
from the program used to annotate is shown in figure 3. 

Data 

The CTA corpus for these experiments is composed of 1400 
CTAs randomly selected from the clinicaltrials.gov website 26 
(a total of 141 386 documents as of March 2013). We anno- 
tated only the eligibility criteria sections of the CTAs. One thou- 
sand of the 1400 CTAs were previously annotated 27 without 
pre-annotation for disease/disorder and sign/symptom. The 
1000 were split in half and randomly assigned to a control 
group and a dictionary generation set. The control and diction- 
ary generation sets are non-overlapping with the experiment 
sets. More detail is provided on the distribution of the remain- 
ing 400 CTAs into experiment sets in the 'Methods' section and 
in figure 4. The distribution of disease/disorder and sign/ 
symptom entities for the CTAs was 196.3 tokens per file, with 
an average count of 7.1 entities per 100 tokens. 

Methods 

The two annotators were given sets of CTAs (both non-labeled 
and pre-annotated) to annotate in the Knowtator program for 
disease/disorder and sign/symptom. The sample size was deter- 
mined based on the training size requirements of the Machine 
Learning algorithms that utilized the annotated CTAs. The 
underlying informatics projects provided the foundation for the 
exploratory pre-annotation experiments. The actual sample size 
is based not on the number of experiment CTAs (400) but on 
the units of analysis, namely the number of annotated entities 
and the number of tokens that the annotators read. Across all 
the control, dictionary, and experiment sets the annotators read 
almost 400 000 tokens (348 445) and annotated 19 002 
medical named entities. 

For the non-labeled text, annotators were asked to annotate 
disease/disorder and sign/symptom entities, as described above. 
For a pre-annotated text, annotators were given the following 
choices: removing an annotation they thought was spurious; 
keeping or modifying said annotation; or adding an additional 
annotation. Figure 3 depicts the Knowtator program, with a set of 
pre-annotations on a particular CTA for an annotator to remove, 
correct, approve, or add a new annotation. In adjudication all dis- 
agreements and any remaining ambiguities were resolved. 



Pre-annotation procedure 

Whereas previous studies relied on machine learning output to 
generate pre-annotation, we relied on a dictionary method in 
our study (figure 4). We evaluated two dictionaries of different 
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Figure 2 Sample disease/disorder and sign/symptom entities. 



sizes and origins, and each dictionary entry consisted of three 
items: term, UMLS CUI, and SNOMED-CT code. The first dic- 
tionary type was created by extracting annotations from the dic- 
tionary generation set of 500 CTAs, as described in the Data 
section. This dictionary is called the 'automated dictionary', as 
it represents the automatically extracted set of all of the annota- 
tions of the gold standard set. The CTA automated dictionary 
contains 3414 diseases/disorders and 294 signs/symptoms. 

The second dictionary type was created manually by the 
annotators, over several weeks. During the adjudication process 
of the double annotated, gold standard generation of the dic- 
tionary generation set of 500 CTAs, the annotators developed a 
list of what they determined to be common annotation decisions 
('manual dictionary'). The CTA manual dictionary contains 
522 disease/disorder entities and 47 signs/symptoms. 

We used regular expression matching to pre-annotate the text, 
with the dictionary terms as input (see figure 4, Experiments). 
The list of matches and their offsets was imported into 
Knowtator in order to assign the class labels for each term. We 
wrote a program to assign the UMLS CUIs and SNOMED-CT 
codes to the pre-annotated terms. Table 1 shows the number of 
dictionary matches for each experiment set. 

We split the text for each experiment into two sets, Setl and 
Set2. Al received non-labeled text in Setl and pre-annotated 
text in Set2; A2 received pre-annotated text in Setl and non- 
labeled text in Set2. Table 1 details each of these sets, as 
follows: the total number of entities for each set; the annotator 
who had the pre-annotated set (Al or A2); the dictionary type 
that was used for the pre-annotation (manual vs automated); 
and the hypothesis tested in the experiment. For the dictionary 
and control sets, the number of entities shown is the number of 
entities in the gold standard. For the experiment sets, the 
number is the result of pre-annotation (the number of entities 
given to the annotator with pre-annotation). Figure 4 also 
details the study design for the experiments. 

Experiments 

There are two experiments (labeled: 1, 2 in table 1). As shown 
in figure 4, each experiment is split into two sub-experiment 
sets (eg, 1.1, 1.2; 2.1, 2.2). The first document set (listed as 
'Dictionary') includes 500 traditionally-annotated, gold standard 
CTAs and is the source of pre-annotation terms for experiments 
1 and 2. The second document set (listed as 'Control') includes 
500 traditionally-annotated, gold standard CTAs. The experi- 
ment sets 1 and 2 comprise the remaining 400 CTAs for experi- 
ments. Each experiment set was double annotated and 
adjudicated for a final gold standard. 

Experiment 1 

Al was given 100 non-labeled CTAs in set 1.1 and 100 pre- 
annotated CTAs in set 1.2. A2 was given pre-annotated CTAs in 
set 1.1 and non-labeled CTAs in set 1.2. The purpose of this 
experiment was to evaluate the potential bias of the CTA 
manual dictionary pre-annotation on the annotator and poten- 
tial pre-annotation time savings using terms for pre-annotation 
that were collected by the annotators in their earlier CTA anno- 
tation projects. 

Experiment 2 

The purpose of this experiment was to evaluate the potential 
bias of the CTA automated dictionary pre-annotation on the 
annotator and potential pre-annotation time savings. Al was 
given 100 non-labeled CTAs in set 2.1 and 100 pre-annotated 
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Figure 3 Pre-annotated clinical trial announcement text in Knowtator. 



CTAs in set 2.2. A2 was given pre-annotated CTAs in set 2.1 
and non-labeled CTAs in set 2.2. The purpose of doing both 
experiments 1 and 2 is to compare how different pre-annotation 
dictionary types (automated and manual in the CTA corpus) 
affects the IAA, performance relative to the eventual gold stand- 
ard (resulting from the adjudication process), and potential time 
savings. 



RESULTS 

Measuring annotator bias 

By comparing the IAA for each set in an experiment (eg, experi- 
ment 1: sets 1.1 and 1.2), we looked for potential bias caused 
by annotating text with pre-annotation. The F-measure (equa- 
tion 3) calculated is the harmonic mean between precision 
(equation 1) and recall (equation 2). 

P = True Positives / (True Positives + False Positives) ( 1 ) 
R = True Positives/ (True Positives + False Negatives) (2) 



F = 2 x P x R/(P + R) 



(3) 



The IAA compares the agreement between each annotator by 
temporarily treating one annotator (eg, Al) as the gold standard 
and calculating the F-measure for the other annotator 
(eg, A2). 28 When we report on the F-measure IAA, we list only 
one per class because the F-measure is identical for each annota- 
tor (Al's precision relative to A2 is A2's recall relative to Al). 



Measuring individual annotators' distance from adjudicated 
gold standard 

After the double annotation of each experiment set, the annotators 
met in adjudication (under the supervision of one of the investiga- 
tors) and came to an agreement on a final gold standard. An 
F-measure was calculated for each annotator, relative to the gold 
standard for each entity class (disease/disorder and sign/symptom) 
for that set. This is what we are calling the annotator's performance. 

Comparing the performance between the annotator who 
received non-labeled text and the annotator who received pre- 
annotated text, within a single experiment set (eg, 1.1), helps to 
show any potential biasing effect that pre-annotation has on the 
annotators' performances, relative to the gold standard. We can 
also compare the same annotator's (eg, Al) annotation speed, in 
the experiment set with non-labeled text (eg, 1.1), with the 
experiment set with pre-annotated text (1.2). 

The impact of pre-annotation on annotation speed is measured 
for the same annotator across sub-experiments (eg, 1.1 vs 1.2), 
using the same corpus and dictionary approach, while the impact 
of pre-annotation on creating bias is measured between annota- 
tors within sub-experiments (eg, Al vs A2 in 1.1 and then again 
in 1.2). The experiments are repeated two times, for dictionary 
differences (manual vs automated). In addition, there is a further 
multiplying factor of two based on which annotator is getting 
pre-annotated text. Altogether there are four sub-experiments (as 
shown in table 1) to control for dictionary and pre-annotation. 



Statistical analysis 

We performed one-way analysis of variance (ANOVA) on nine 
variables: Al F-measure against the gold standard, for disease/ 
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Figure 4 Experiment study design. 
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A1, annotator 1; A2, annotator 2; CTA, clinical trial announcements; DD, disease/disorder; SS, sign/symptom. 
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Table 2 IAA and annotator performance 



Experiment set 


IAA (%) 


Performance (%) 
A1 


A2 


1.1 


95.5 


98.8 


96.4 


1.2 


93.4 


98.2 


95.2 


2.1 


93.7 


97.0 


96.0 


2.2 


94.7 


97.0 


96.9 



A1, annotator 1; A2, annotator 2; IAA, inter-annotator agreement. 



disorder annotation; A2 F-measure against the gold standard, 
for disease/disorder annotation; Al F-measure against the gold 
standard, for sign/symptom annotation; A2 F-measure against 
the gold standard, for sign/symptom annotation; Al versus A2 
IAA, for disease/disorder annotation; Al versus A2 IAA, for 
sign/symptom annotation; number of class entities, tokens, and 
CUI/CODE entities. 

The purpose of performing an ANOVA test on each of these 
variables was to determine if the variance between files and 
annotators was statistically significant. Due to the number of dif- 
ferent tests conducted, we applied a very conservative 
Bonferroni correction to account for the increased possibility of 
type I error. Thus, to adjust for nine different significance tests 
with multiple variables that may not be independent, 29 findings 
were considered statistically significant at p< 0.0001. 

The sets for comparison were the control documents for each 
experiment set, both as pairs (1.1/1.2, etc.) and individually (1.1 
vs Control, 1.2 vs Control, etc.). 

To calculate statistical significance, in order to test whether 
the IAA between annotators or whether each annotator 's per- 
formance was significantly different among experiment sets, we 
calculated the F-measures per files. These per-file F-measures 
were compared using a one-way ANOVA test for statistical sig- 
nificance among the different experimental groups. Table 2 
shows the per-set averages of the F-measures for IAA and anno- 
tator performance. 

Time savings 

To test for time savings in annotation for each set, we recorded 
the annotation times and compared them to evaluate the effect 
of pre-annotation. 

Table 3 displays the time savings for pre-annotated text over 
non-labeled text. For each experiment set, the amount of time 
needed to annotate and calculated time savings of annotating 
with pre-annotated text are indicated. Also included is the 
average of both sets of each experiment. For example, set 1.1 
took 17.7 h for pre-annotated text and 20.5 h for non-labeled 
text. This represents an overall time savings of 13.9% for A2, 
who had pre-annotated text. Also in set 1.1, A2 took an average 
of 45.4 s per entity with pre-annotated text, while Al took an 



average of 7.3 s longer per entity with non-labeled text. The 
average between the two sub-experiment 1 sets was 16.6% for 
overall time and for per-entity time savings. The greatest overall 
time savings is in set 2.2 (automated dictionary pre-annotation) 
with 20.8%. A paired t test shows that the time savings in each 
experiment set were significant (p<0.01). 

Comparisons for statistical significance 

Each experiment set has three F-measures (agreement between 
the annotators and performance for each annotator). The per- 
formance reported is the combined class F-value, which is the 
F-measure for both classes of disease/disorder and sign/ 
symptom; these are listed in table 2. Table 4 lists the p values 
from the results of the ANOVA comparisons for each experi- 
ment pair. The purpose of this comparison is to examine if 
there is a significant difference in an annotator's performance 
when receiving pre-annotated or non-labeled text. The annota- 
tor F-measures are separated according to entity class. The 
control 500 CTA set, where no pre-annotation occurred, pro- 
vides a set for comparison. In each column, an experiment set is 
compared against the control set. In the first column the pooled 
CTA text sets (1.1-2.2) were compared against the control set. 
In the second column, the set 1.1 was compared, and so on. 

The results in table 4 show that when annotating signs and 
symptoms, the annotators' performance and IAA are signifi- 
cantly different from the eventual gold standard on Bonferroni 
p< 0.0001 level. This finding is significant for experiment sets 
1.1 and 1.2. None of the other comparisons show statistically 
significant difference. 

Table 4 also lists the p value for each variable in intra- 
experiment ANOVA comparisons. There is no statistically signifi- 
cant difference between manual and automated experiments. 

DISCUSSION 
Time savings 

In every experiment pair, the annotator with the pre-annotated 
text took less time to annotate than the annotator with non- 
labeled text. This illustrates a clear time savings and, unlike 
other studies, 4 spurious annotations in the pre-annotation set 
did not affect the annotator's performance. The time saved in 
each experiment was significant (p<0.01). The time savings 
result in part from reducing the amount of time an annotator 
has to look up entities to match with the UMLS terminology 
database (see figure 1). 

The automated dictionary pre-annotation experiment (2.1/2.2) 
shows greater per-entity time savings, compared to the manual 
dictionary experiment sets (20.8% time savings vs 16.6%). 
The reduced time savings of the manual dictionary-based 
pre-annotation set versus the automated may be due to a lack of 
coverage, since the automated dictionary contained more than six 
times the total entries (3708 vs 569). During adjudication we 
learned that many of the smaller abbreviations (eg, 'ms' (multiple 



Table 3 Overall and per entity time savings 

Overall time (hours) Time per entity (seconds) 

Average per Average per 



Experiment set 


Pre-annotated text 


Non-label 


% Saved 


experiment (%) 


Pre 


Non-label 


% Saved 


experiment (%) 


p Value 


1.1 


17.7 


20.5 


13.9 




45.4 


52.7 


13.9 




<0.01 


1.2 


14.3 


17.7 


19.3 


16.6 


34.9 


43.3 


19.3 


16.6 


<0.01 


2.1 


14 


17.5 


20.0 




30.5 


38.2 


20.0 




<0.01 


2.2 


14.25 


18.2 


21.5 


20.8 


28.7 


36.6 


21.5 


20.8 


<0.01 
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Table 4 Statistical significance of experiments 


Statistical significance of experiments 1-2 (CTA) 




CTAvs 500* 


1.1 vs 500 


1.2 vs 500 


2.1 vs 500 


2.2 vs 500 


A1 vs GS (D) 


0.37 


0.38 


0.43 


0.44 


0.08 


A1 vs GS (S) 


0.42 


<0.0001 


<0.0001 


0.3 


0.98 


A2 vs GS (D) 


0.36 


0.14 


0.01 


0.22 


0.11 


A2 vs GS (S) 


0.38 


<0.0001 


<0.0001 


0.57 


0.01 


IAA (D) 


0.34 


0.96 


0.24 


0.54 


0.95 


IAA (S) 


0.06 


<0.0001 


<0.0001 


0.38 


0.73 


Code_Ent 


0.35 


0.14 


0.28 


0.48 


0.22 


DS_Ent 


0.45 


0.13 


0.03 


0.98 


0.16 


Tokens 


0.4 


0.06 


0.03 


0.11 


0.15 


Intra-experiment significance 






1.1 vs 1.2 




2.1 vs 2.2 


A1 vs GS (D) 




0.2 






0.06 


A1 vs GS (S) 




0.47 






0.76 


A2 vs GS (D) 




0.45 






0.83 


A2 vs GS (S) 




0.43 






0.57 


IAA (D) 




0.12 






0.48 


IAA (S) 




0.29 






0.46 


* Control for CTA. 

A1, annotator 1; A2, annotator 2; CTA, clinical trial announcements; 

D, disease/disorder; GS, gold standard; IAA, inter-annotator agreement; S, sign/ 

symptom. 

Bold indicates statistical significance at p<0.0001 . 



sclerosis), 'all' (acute lymphoblastic lymphoma)) produce spurious 
annotations that cost time in removing. The pre-annotation 
program performed the lookup and annotated without regard to 
capitalization, matching complete tokens only. For example, the 
abbreviation MS would match both 'ms' and 'MS', but not 'aims'. 
Modifications to the pre-annotation program could be developed 
to allow shorter (two to three letter) abbreviations to be case sensi- 
tive and further increase time savings for pre-annotated tasks. 

To put the time savings in perspective, a 3 -month (60 work days) 
annotation project can be reduced to as little as 48 days when using 
an automatically generated pre-annotation dictionary on CTAs. 
When using a manually created pre-annotation dictionary on the 
same corpus, the 60 days can be reduced to 50. For projects that 
implement double annotation, the saved labor cost is twice of the 
saved time, as 10 days annotation time saves 20 days' labor. 

Performance 

Compared to the eventual gold standard, the annotator without 
pre-annotation missed short abbreviations more often, including 
V (vomiting), 'uti' (urinary tract infection), and 'mm' (multiple 
myeloma). Pre-annotation can capture these short tokens. 
However, the annotator without pre-annotation missed fewer 
long phrasal annotations which require close reading of the text 
such as 'lack of progress in his speech sound development' and 
'decreased active rotation range of motion'. In addition, although 
the time savings were significant, the annotator with pre- 
annotated text tended to allow frequently occurring terms like 
'disease' or 'infections' to remain unmodified, even if there were 
additional qualifying terms like 'autoimmune' or 'hepatitis'. 

Comparisons for statistical significance 

The purpose of performing an ANOVA test on each of the nine 
variables was to determine if any of the variance was statistically 
significant. We demonstrated that the class entities, tokens, and 
CUI/CODE entities were not statistically significantly different, in 



most of the set comparisons, when compared to the baseline set. 
This indicates that the texts' structures are not so different as to 
cause annotation speed differences. In CTAs, sign/symptom 
entities are not as frequent as disease/disorder and only average 
0.9 entities per file, or 0.32 per 100 tokens. We believe that rare 
sign/symptoms entities in this corpus did not provide a strong 
basis for the statistical significance test and this is the reason why 
the sign/symptom IAA and annotator performance were statistic- 
ally significantly different in experiment 1 (table 4). 

Another important comparison point is the intra-experiment 
ANOVA calculation for annotator performances. This indicates 
the potential statistical significance of the variance between the 
annotator who had pre-annotation and the annotator who did 
not, between manual and automated dictionary experiments. In 
no category of F-measure was the performance difference statis- 
tically significant to a Bonferroni corrected p value of 0.0001 
(table 4, intra-experiment significance); that is, the pre- 
annotation did not introduce annotation bias. 

Limitations 

A limitation of this study is that annotation time savings and 
potential annotation bias are not tested in the same sub- 
experiments. However, this is mitigated by the careful study 
design and ANOVA tests. Another limitation is the focus on just 
one corpora (CTA) and one source of dictionaries. Although pre- 
liminary results showed a similar pattern for pre-annotation 
experiments on a clinical note corpus, further research is needed 
with multiple different clinical corpora. Future studies should 
also experiment with other dictionary sources such as UMLS. 
Finally, cross-corpora pre-annotation experiments have been 
planned with dictionaries generated from different types of clin- 
ical texts. For practical purposes it is a limitation of our proposed 
method that we did not test the pre-annotation value of the dic- 
tionaries based on the number of underlying documents. That is, 
we used a fixed set of 500 documents to generate the dictionaries 
instead of consecutively increasing sets (eg, 100, 200, and so on 
documents). Future research should test if a dictionary based on 
smaller number of notes would have a beneficial effect. 

CONCLUSIONS 

This study evaluated the effects of pre-annotation on annotation 
time and annotator bias in the annotation of disease/disorder and 
sign/symptom entities for an important clinical corpora, CTAs. 
The pre-annotated set was created from either an automatically 
extracted or a manually generated dictionary. Time savings were 
statistically significant and present in all of the experiments, 
when the annotator used pre-annotated text. There was no statis- 
tically significant difference in annotator performance or IAA 
between using a manually or automatically collected dictionary 
of pre-annotation sets. Furthermore, pre-annotated text did not 
introduce bias for the annotations. We conclude that either 
manually or automatically generated dictionary-based pre- 
annotation is a feasible and practical method to reduce the cost 
of clinical NER in the eligibility sections of CTAs without intro- 
ducing bias in the annotation process. 
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